Name(s): Andrew Zhao, Yiheng Yuan
Website Link: https://asdacdsfca.github.io/LoL_Model/
import pandas as pd
import numpy as np
import os
import plotly.express as px
pd.options.plotting.backend = 'plotly'
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import FunctionTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import train_test_split
import warnings
from pandas.core.common import SettingWithCopyWarning
warnings.simplefilter(action="ignore", category=SettingWithCopyWarning)
Prediction Question: Predict whether a player's position is support given their post-game data.
Type: Classification
We chose classification because our prediction target is a categorical data type. Classification is a predictive model to identify discrete output variables, such as labels or categories.
Since we only predict whether the position is "support," each data sample will assign one and only one label from two mutually exclusive classes, such as true and false.
We chose it because our sample questions implies that we want to predict the roles of the players which are the values in position column.
The reason we chose accuracy over other suitable metrics is we are more focus on how "good" our prediction is compare to actual result and
Get the original dataset:
lol = pd.read_csv(('2022_LoL_esports_match_data_from_OraclesElixir.csv'))
lol
/opt/anaconda3/envs/dsc80/lib/python3.8/site-packages/IPython/core/interactiveshell.py:3442: DtypeWarning: Columns (2) have mixed types.Specify dtype option on import or set low_memory=False. exec(code_obj, self.user_global_ns, self.user_ns)
| gameid | datacompleteness | url | league | year | split | playoffs | date | game | patch | ... | opp_csat15 | golddiffat15 | xpdiffat15 | csdiffat15 | killsat15 | assistsat15 | deathsat15 | opp_killsat15 | opp_assistsat15 | opp_deathsat15 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | ESPORTSTMNT01_2690210 | complete | NaN | LCK CL | 2022 | Spring | 0 | 2022-01-10 07:44:08 | 1 | 12.01 | ... | 121.0 | 391.0 | 345.0 | 14.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 |
| 1 | ESPORTSTMNT01_2690210 | complete | NaN | LCK CL | 2022 | Spring | 0 | 2022-01-10 07:44:08 | 1 | 12.01 | ... | 100.0 | 541.0 | -275.0 | -11.0 | 2.0 | 3.0 | 2.0 | 0.0 | 5.0 | 1.0 |
| 2 | ESPORTSTMNT01_2690210 | complete | NaN | LCK CL | 2022 | Spring | 0 | 2022-01-10 07:44:08 | 1 | 12.01 | ... | 119.0 | -475.0 | 153.0 | 1.0 | 0.0 | 3.0 | 0.0 | 3.0 | 3.0 | 2.0 |
| 3 | ESPORTSTMNT01_2690210 | complete | NaN | LCK CL | 2022 | Spring | 0 | 2022-01-10 07:44:08 | 1 | 12.01 | ... | 149.0 | -793.0 | -1343.0 | -34.0 | 2.0 | 1.0 | 2.0 | 3.0 | 3.0 | 0.0 |
| 4 | ESPORTSTMNT01_2690210 | complete | NaN | LCK CL | 2022 | Spring | 0 | 2022-01-10 07:44:08 | 1 | 12.01 | ... | 21.0 | 443.0 | -497.0 | 7.0 | 1.0 | 2.0 | 2.0 | 0.0 | 6.0 | 2.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 149227 | 9687-9687_game_5 | partial | https://lpl.qq.com/es/stats.shtml?bmid=9687 | DC | 2022 | NaN | 0 | 2022-12-27 12:43:43 | 5 | 12.23 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 149228 | 9687-9687_game_5 | partial | https://lpl.qq.com/es/stats.shtml?bmid=9687 | DC | 2022 | NaN | 0 | 2022-12-27 12:43:43 | 5 | 12.23 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 149229 | 9687-9687_game_5 | partial | https://lpl.qq.com/es/stats.shtml?bmid=9687 | DC | 2022 | NaN | 0 | 2022-12-27 12:43:43 | 5 | 12.23 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 149230 | 9687-9687_game_5 | partial | https://lpl.qq.com/es/stats.shtml?bmid=9687 | DC | 2022 | NaN | 0 | 2022-12-27 12:43:43 | 5 | 12.23 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 149231 | 9687-9687_game_5 | partial | https://lpl.qq.com/es/stats.shtml?bmid=9687 | DC | 2022 | NaN | 0 | 2022-12-27 12:43:43 | 5 | 12.23 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
149232 rows × 123 columns
lol_copy = lol.copy()
Clean out the position that equals to team so we are only looking at the data of players
lol_cleaned = lol_copy.loc[lol['position']!='team', :]
Get all the columns that is need for the prediction question
lol_cleaned = lol_cleaned[['patch', 'champion','position','kills', 'deaths', 'assists' ,'dpm', 'damageshare',
'damagetakenperminute', 'vspm', 'earned gpm', 'cspm']]
Identify all the rows that contains NaN values:
lol_cleaned.loc[lol_cleaned['dpm'].isna()]
| patch | champion | position | kills | deaths | assists | dpm | damageshare | damagetakenperminute | vspm | earned gpm | cspm | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 17868 | 12.02 | Gwen | top | 2 | 3 | 3 | NaN | NaN | NaN | NaN | 274.9249 | 9.3093 |
| 17869 | 12.02 | Lee Sin | jng | 2 | 4 | 3 | NaN | NaN | NaN | NaN | 135.8258 | 3.6036 |
| 17870 | 12.02 | Lissandra | mid | 1 | 3 | 1 | NaN | NaN | NaN | NaN | 210.1802 | 8.4685 |
| 17871 | 12.02 | Jhin | bot | 0 | 4 | 3 | NaN | NaN | NaN | NaN | 177.9880 | 7.4474 |
| 17872 | 12.02 | Yuumi | sup | 1 | 4 | 5 | NaN | NaN | NaN | NaN | 74.7748 | 0.9309 |
| 17873 | 12.02 | Tryndamere | top | 1 | 3 | 5 | NaN | NaN | NaN | NaN | 248.4985 | 8.4384 |
| 17874 | 12.02 | Jarvan IV | jng | 3 | 0 | 14 | NaN | NaN | NaN | NaN | 225.7357 | 5.7057 |
| 17875 | 12.02 | Viktor | mid | 3 | 2 | 11 | NaN | NaN | NaN | NaN | 256.0661 | 8.3784 |
| 17876 | 12.02 | Kai'Sa | bot | 10 | 0 | 6 | NaN | NaN | NaN | NaN | 430.8408 | 10.7808 |
| 17877 | 12.02 | Nautilus | sup | 1 | 1 | 13 | NaN | NaN | NaN | NaN | 119.1291 | 1.0511 |
Since there are only 10 rows over all 124360 rows, we can ignore them.
lol_cleaned = lol_cleaned.loc[lol_cleaned['dpm'].notna()]
Transform the position column into a binary column, 1 if the position is support, otherwise 0.
lol_cleaned.loc[lol_cleaned['position']!='sup', ['position']] = 0
lol_cleaned.loc[lol_cleaned['position']=='sup', ['position']] = 1
lol_cleaned['position'] = lol_cleaned['position'].astype(int)
lol_cleaned
| patch | champion | position | kills | deaths | assists | dpm | damageshare | damagetakenperminute | vspm | earned gpm | cspm | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 12.01 | Renekton | 0 | 2 | 3 | 2 | 552.2942 | 0.278784 | 1072.3993 | 0.9107 | 250.9282 | 8.0911 |
| 1 | 12.01 | Xin Zhao | 0 | 2 | 5 | 6 | 412.0841 | 0.208009 | 944.2732 | 1.6813 | 188.0210 | 5.1839 |
| 2 | 12.01 | LeBlanc | 0 | 2 | 2 | 3 | 499.4046 | 0.252086 | 581.6462 | 1.0158 | 208.2312 | 6.7601 |
| 3 | 12.01 | Samira | 0 | 2 | 4 | 2 | 389.0018 | 0.196358 | 463.8529 | 0.8757 | 239.4046 | 7.9159 |
| 4 | 12.01 | Leona | 1 | 1 | 5 | 6 | 128.3012 | 0.064763 | 475.0263 | 2.4168 | 101.8564 | 1.4711 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 149225 | 12.23 | Jax | 0 | 4 | 0 | 5 | 450.5737 | 0.171729 | 608.3352 | 0.7762 | 331.7885 | 9.4826 |
| 149226 | 12.23 | Vi | 0 | 2 | 4 | 11 | 201.7660 | 0.076899 | 762.7897 | 1.3161 | 211.6198 | 4.9944 |
| 149227 | 12.23 | Ahri | 0 | 6 | 3 | 8 | 647.4128 | 0.246762 | 553.8695 | 2.2610 | 292.4747 | 7.9303 |
| 149228 | 12.23 | Varus | 0 | 7 | 0 | 12 | 954.3982 | 0.363768 | 292.1035 | 1.5186 | 351.4961 | 8.4702 |
| 149229 | 12.23 | Ashe | 1 | 2 | 1 | 13 | 369.5163 | 0.140841 | 269.4601 | 4.1170 | 162.3510 | 1.3498 |
124350 rows × 12 columns
The metric we chose:
lol_cleaned.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 124350 entries, 0 to 149229 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 patch 124260 non-null float64 1 champion 124350 non-null object 2 position 124350 non-null int64 3 kills 124350 non-null int64 4 deaths 124350 non-null int64 5 assists 124350 non-null int64 6 dpm 124350 non-null float64 7 damageshare 124350 non-null float64 8 damagetakenperminute 124350 non-null float64 9 vspm 124350 non-null float64 10 earned gpm 124350 non-null float64 11 cspm 124350 non-null float64 dtypes: float64(7), int64(4), object(1) memory usage: 12.3+ MB
lol_cleaned
| patch | champion | position | kills | deaths | assists | dpm | damageshare | damagetakenperminute | vspm | earned gpm | cspm | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 12.01 | Renekton | 0 | 2 | 3 | 2 | 552.2942 | 0.278784 | 1072.3993 | 0.9107 | 250.9282 | 8.0911 |
| 1 | 12.01 | Xin Zhao | 0 | 2 | 5 | 6 | 412.0841 | 0.208009 | 944.2732 | 1.6813 | 188.0210 | 5.1839 |
| 2 | 12.01 | LeBlanc | 0 | 2 | 2 | 3 | 499.4046 | 0.252086 | 581.6462 | 1.0158 | 208.2312 | 6.7601 |
| 3 | 12.01 | Samira | 0 | 2 | 4 | 2 | 389.0018 | 0.196358 | 463.8529 | 0.8757 | 239.4046 | 7.9159 |
| 4 | 12.01 | Leona | 1 | 1 | 5 | 6 | 128.3012 | 0.064763 | 475.0263 | 2.4168 | 101.8564 | 1.4711 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 149225 | 12.23 | Jax | 0 | 4 | 0 | 5 | 450.5737 | 0.171729 | 608.3352 | 0.7762 | 331.7885 | 9.4826 |
| 149226 | 12.23 | Vi | 0 | 2 | 4 | 11 | 201.7660 | 0.076899 | 762.7897 | 1.3161 | 211.6198 | 4.9944 |
| 149227 | 12.23 | Ahri | 0 | 6 | 3 | 8 | 647.4128 | 0.246762 | 553.8695 | 2.2610 | 292.4747 | 7.9303 |
| 149228 | 12.23 | Varus | 0 | 7 | 0 | 12 | 954.3982 | 0.363768 | 292.1035 | 1.5186 | 351.4961 | 8.4702 |
| 149229 | 12.23 | Ashe | 1 | 2 | 1 | 13 | 369.5163 | 0.140841 | 269.4601 | 4.1170 | 162.3510 | 1.3498 |
124350 rows × 12 columns
position from lol_cleaned to predict whether a player's position is support¶# Get training and testing dataset for X (values use to predict) and Y (response)
X = lol_cleaned.drop(columns = 'position')
y = lol_cleaned['position']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)
from sklearn.tree import DecisionTreeClassifier
preproc = ColumnTransformer(
transformers = [
('cat_cols', OneHotEncoder(handle_unknown='ignore'), ['patch', 'champion'])
],
#since patch is ordinal, do we have to perform ordinal encoding or it is fine to just use
#one hot encoding
remainder='passthrough'
)
pl = Pipeline([
('preprocessor', preproc),
('decision-tree', DecisionTreeClassifier(max_depth=2))
])
pl.fit(X_train, y_train)
Pipeline(steps=[('preprocessor',
ColumnTransformer(remainder='passthrough',
transformers=[('cat_cols',
OneHotEncoder(handle_unknown='ignore'),
['patch', 'champion'])])),
('decision-tree', DecisionTreeClassifier(max_depth=2))])
Note: Even though the 'patch' column contains ordinal values, since the order of the patch does not have any meaningful relationship with the position we are trying to predict, that is, as the patch increase does not mean people are more likely to play support or less likely to play support. In conclusion, we perform one hot encoding rather than ordinal encoding.
# The score we got on training data
pl.score(X_train, y_train)
0.9940490231820034
# The score we got on testing data
pl.score(X_test, y_test)
0.9934701492537313
from sklearn.preprocessing import QuantileTransformer, FunctionTransformer
from sklearn.preprocessing import StandardScaler
# Define the function for function transformer
def k_a(df):
return ((df['kills']+1)/(df['assists']+1)).to_frame()
# initialize functiontransformer
k_a_trans = FunctionTransformer(k_a)
# Get training and testing dataset for X (values use to predict) and Y (response)
X = lol_cleaned.drop(columns = 'position')
y = lol_cleaned['position']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)
# Create a pipeline for functiontransformer
k_a_pipe = Pipeline([
('to_k_a', k_a_trans)
])
# Fit and transform our training data and generate the model
preproc_final = ColumnTransformer(
transformers = [
('cat_cols', OneHotEncoder(handle_unknown='ignore'), ['patch', 'champion']),
('quantile', QuantileTransformer(n_quantiles = 100), ['vspm', 'earned gpm']),
#get rid of outliers. more accurate and more representative. so our model focuses on the majority part
('k_a', k_a_pipe, ['kills', 'assists'])
],
remainder='passthrough'
)
pl_final = Pipeline([
('preprocessor', preproc_final),
('tree', DecisionTreeClassifier(max_depth=2))
])
pl_final.fit(X_train, y_train)
Pipeline(steps=[('preprocessor',
ColumnTransformer(remainder='passthrough',
transformers=[('cat_cols',
OneHotEncoder(handle_unknown='ignore'),
['patch', 'champion']),
('quantile',
QuantileTransformer(n_quantiles=100),
['vspm', 'earned gpm']),
('k_a',
Pipeline(steps=[('to_k_a',
FunctionTransformer(func=<function k_a at 0x7fc23814ea60>))]),
['kills', 'assists'])])),
('tree', DecisionTreeClassifier(max_depth=2))])
# Score we got on training data
pl_final.score(X_train, y_train)
0.9944564774506229
# Score we got on final data
pl_final.score(X_test, y_test)
0.9942099845599588
GridSearchCV to search for the best hyperparameters for our final model¶from sklearn.model_selection import GridSearchCV
# Hyperparameters to try
hyperparameters = {
'tree__max_depth':[i for i in range(1,20)],
'tree__min_samples_split': [2,5,10],
'tree__criterion':['gini', 'entropy']
}
# perform GridSearchCV to search for the best Hyperparameters
searcher = GridSearchCV(pl_final, param_grid=hyperparameters, cv=5)
searcher.fit(X_train, y_train)
GridSearchCV(cv=5,
estimator=Pipeline(steps=[('preprocessor',
ColumnTransformer(remainder='passthrough',
transformers=[('cat_cols',
OneHotEncoder(handle_unknown='ignore'),
['patch',
'champion']),
('quantile',
QuantileTransformer(n_quantiles=100),
['vspm',
'earned '
'gpm']),
('k_a',
Pipeline(steps=[('to_k_a',
FunctionTransformer(func=<function k_a at 0x7fc23814ea60>))]),
['kills',
'assists'])])),
('tree',
DecisionTreeClassifier(max_depth=5,
min_samples_split=10))]),
param_grid={'tree__criterion': ['gini', 'entropy'],
'tree__max_depth': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11,
12, 13, 14, 15, 16, 17, 18, 19],
'tree__min_samples_split': [2, 5, 10]})
# the best hyperparameters got by GridSearchCV
searcher.best_params_
{'tree__criterion': 'gini',
'tree__max_depth': 5,
'tree__min_samples_split': 10}
# Final model with the best hyperparameters got by GridSearchCV
preproc_final = ColumnTransformer(
transformers = [
('cat_cols', OneHotEncoder(handle_unknown='ignore'), ['patch', 'champion']),
('quantile', QuantileTransformer(n_quantiles = 100), ['vspm', 'earned gpm']),
#get rid of outliers. more accurate and more representative. so our model focuses on the majority part
('k_a', k_a_pipe, ['kills', 'assists'])
],
remainder='passthrough'
)
pl_final_hyp = Pipeline([
('preprocessor', preproc_final),
('tree', DecisionTreeClassifier(criterion = 'gini', max_depth = 5, min_samples_split = 10))
])
pl_final_hyp.fit(X_train, y_train)
Pipeline(steps=[('preprocessor',
ColumnTransformer(remainder='passthrough',
transformers=[('cat_cols',
OneHotEncoder(handle_unknown='ignore'),
['patch', 'champion']),
('quantile',
QuantileTransformer(n_quantiles=100),
['vspm', 'earned gpm']),
('k_a',
Pipeline(steps=[('to_k_a',
FunctionTransformer(func=<function k_a at 0x7fc23814ea60>))]),
['kills', 'assists'])])),
('tree',
DecisionTreeClassifier(max_depth=5, min_samples_split=10))])
# Score we got on training data
pl_final_hyp.score(X_train, y_train)
0.9953142759108747
y_pred = pl_final_hyp.predict(X_test)
y_pred
array([0, 0, 0, ..., 0, 0, 1])
# Score we got on final data
pl_final_hyp.score(X_test, y_test)
0.9947568193515183
Group X: player who pick champion that is defualt as a support champion by league of legends offically
Group Y: player who pick champion that is not defualt as a support champion by league of legends offically
reference URL: https://www.leagueoflegends.com/en-us/champions/
support_champion = ['Alistar', 'Anivia', 'Ashe',
'Bard', 'Braum', 'Fiddlesticks',
'Heimerdinger', 'Ivern', 'Janna',
'Karma', 'Kayle', 'Leona',
'Lulu', 'Lux', 'Morgana',
'Nami', 'Nautilus', 'Neeko',
'Orianna', 'Pyke', 'Rakan',
'Rell', 'Renata Glasc', 'Senna',
'Seraphine', 'Sona', 'Soraka',
'Tahm Kench', 'Taliyah', 'Taric',
'Thresh', 'Yuumi', 'Zilean',
'Zoe', 'Zyra']
from sklearn import metrics
metrics.plot_confusion_matrix(pl_final, X_test, y_test)
/opt/anaconda3/envs/dsc80/lib/python3.8/site-packages/sklearn/utils/deprecation.py:87: FutureWarning: Function plot_confusion_matrix is deprecated; Function `plot_confusion_matrix` is deprecated in 1.0 and will be removed in 1.2. Use one of the class methods: ConfusionMatrixDisplay.from_predictions or ConfusionMatrixDisplay.from_estimator. warnings.warn(msg, category=FutureWarning)
<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay at 0x7fc25b75f4f0>
metrics.precision_score(y_test, y_pred)
0.9940274414850686
1 - metrics.precision_score(y_test, y_pred)
0.005972558514931392
metrics.recall_score(y_test, y_pred)
0.9799490770210058
1 - metrics.recall_score(y_test, y_pred)
0.020050922978994246
a high false negative rate, recall, is bad
results = X_test
results['is_support'] = results['champion'].apply(lambda x: 1 if x in support_champion else 0)
results['prediction'] = y_pred
results['position'] = y_test
(
results
.groupby('is_support')
.apply(lambda x: 1 - metrics.recall_score(x['position'], x['prediction']))
.plot(kind='bar', title='False Negative Rate by Champion Group')
)
results.groupby('is_support')['prediction'].mean().to_frame()
| prediction | |
|---|---|
| is_support | |
| 0 | 0.013301 |
| 1 | 0.801937 |
(
results
.groupby('is_support')
.apply(lambda x: metrics.accuracy_score(x['position'], x['prediction']))
.rename('accuracy')
.to_frame()
)
| accuracy | |
|---|---|
| is_support | |
| 0 | 0.997980 |
| 1 | 0.984313 |
Null Hypothesis: Our model is fair. Its accuracy for player who choose a support champion and player who choose a non-support champion are roughly the same, and any differences are due to random chance.
Alternative Hypothesis: Our model is unfair. Its accuracy for player who choose a support champion is lower than its precision for player who choose a non-support champion.
Evaluation Metric: Accuracy
Test statistic: Difference in accuracy (support minus non-support).
Significance level: 0.05.
obs = results.groupby('is_support').apply(lambda x: \
metrics.accuracy_score(x['position'], x['prediction'])).diff().iloc[-1]
obs
-0.01366635231094171
diff_in_acc = []
for _ in range(1000):
s = (
results[['is_support', 'prediction', 'position']]
.assign(is_support=results.is_support.sample(frac=1.0, replace=False).reset_index(drop=True))
.groupby('is_support')
.apply(lambda x: metrics.accuracy_score(x['position'], x['prediction']))
.diff()
.iloc[-1]
)
diff_in_acc.append(s)
p-value:
(diff_in_acc > obs).mean()
1.0